{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5151afed",
   "metadata": {},
   "source": [
    "# 问答\n",
    "\n",
    "## 用例\n",
    "此处展示了如何使用 Langchian + 千帆 SDK 完成对特定文档完成获取、切分、转为向量并存储，而后根据你的提问来从文中获取答案。\n",
    "并且借助 Langsmith 将整个过程可视化展现\n",
    "\n",
    "## 概览\n",
    "把一个非结构化的文档转成问答链涉及以下步骤：\n",
    "1. `Loading`: 首先我们需要加载数据，非结构化的数据可以从多种渠道加载。点击 [LangChain integration hub](https://integrations.langchain.com/) 查看所有 Langchain 支持的 Loader。\n",
    "每个 Loader 都会返回 Langchian 中的 [`Document`](/docs/components/schema/document) 对象。\n",
    "\n",
    "2. `Splitting`: [文本切分器](/docs/modules/data_connection/document_transformers/) 把 `Documents` 切分成特定的大小。\n",
    "\n",
    "3. `Storage`: `Storage` （例如 [vectorstore](/docs/modules/data_connection/vectorstores/)）会将切分的数据储存起来，通常还附带对文本做 [embedding](https://www.pinecone.io/learn/vector-embeddings/) 。\n",
    "\n",
    "4. `Retrieval`: 用于从 `Storage` 中获取切分的数据，用于后面生成答案。\n",
    "\n",
    "5. `Generation`: 使用提示词和获取到的数据，搭配 [LLM](/docs/modules/model_io/models/llms/) 来生成回答。\n",
    "\n",
    "6. `Conversation` (扩展): 添加 [Memory](/docs/modules/memory/) 模块来在你的问答链上实现多轮对话。\n",
    "\n",
    "![flow.jpeg](img/qa_flow.jpeg)\n",
    "\n",
    "接下来我们会演示如何一步步构造我们自己的流水线，并且实现我们自己定制化的功能\n",
    "\n",
    "## Step 0. Prepare\n",
    "\n",
    "为了能够运行我们的 Demo，首先我们需要下载以下对应版本的依赖并且设置环境变量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f71be487-126d-4424-a568-93d3f2f3f1ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install langchain\n",
    "!pip install chromadb\n",
    "!pip install qianfan\n",
    "!pip install pdfplumber"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "0668c799",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chromadb                      0.4.14\n",
      "langchain                     0.0.335\n",
      "pdfplumber                    0.10.3\n",
      "qianfan                       0.1.3\n"
     ]
    }
   ],
   "source": [
    "!pip list | grep -E 'qianfan|langchain|chromadb|pdfplumber'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "981930ef-a0b4-46f9-b60b-a495117ea38e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ['QIANFAN_AK'] = \"your_ak\"\n",
    "os.environ['QIANFAN_SK'] = \"your_sk\"\n",
    "\n",
    "# 此处为 Langsmith 相关功能开关。当且仅当你知道这是做什么用时，可删除注释并设置变量以使用 Langsmith 相关功能\n",
    "# os.environ['LANGCHAIN_TRACING_V2'] = \"true\"\n",
    "# os.environ['LANGCHAIN_ENDPOINT'] = \"https://api.smith.langchain.com\"\n",
    "# os.environ['LANGCHAIN_API_KEY'] = \"LANGCHAIN_API_KEY\"\n",
    "# os.environ['LANGCHAIN_PROJECT'] = \"LANGCHAIN_PROJECT\"\n",
    "\n",
    "\n",
    "is_chinese = True\n",
    "\n",
    "if is_chinese:\n",
    "    WEB_URL = \"https://zhuanlan.zhihu.com/p/85289282\"\n",
    "    CUSTOM_PROMPT_TEMPLATE = \"\"\"\n",
    "        使用下面的语料来回答本模板最末尾的问题。如果你不知道问题的答案，直接回答 \"我不知道\"，禁止随意编造答案。\n",
    "        为了保证答案尽可能简洁，你的回答必须不超过三句话。\n",
    "        请注意！在每次回答结束之后，你都必须接上 \"感谢你的提问\" 作为结束语\n",
    "        以下是一对问题和答案的样例：\n",
    "            请问：秦始皇的原名是什么\n",
    "            秦始皇原名嬴政。感谢你的提问。\n",
    "        \n",
    "        以下是语料：\n",
    "        \n",
    "        {context}\n",
    "        \n",
    "        请问：{question}\n",
    "    \"\"\"\n",
    "    QUESTION1 = \"明朝的开国皇帝是谁\"\n",
    "    QUESTION2 = \"朱元璋是什么时候建立的明朝\"\n",
    "else:\n",
    "    WEB_URL = \"https://lilianweng.github.io/posts/2023-06-23-agent/\"\n",
    "    CUSTOM_PROMPT_TEMPLATE = \"\"\"\n",
    "        Use the following pieces of context to answer the question at the end. \n",
    "        If you don't know the answer, just say that you don't know, don't try to make up an answer. \n",
    "        Use three sentences maximum and keep the answer as concise as possible. \n",
    "        Always say \"thanks for asking!\" at the end of the answer. \n",
    "        {context}\n",
    "        Question: {question}\n",
    "        Helpful Answer:\n",
    "    \"\"\"\n",
    "    QUESTION1 = \"How do agents use Task decomposition?\"\n",
    "    QUESTION2 = \"What are the various ways to implemet memory to support it?\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba5daed6",
   "metadata": {},
   "source": [
    "## Step 1. Load\n",
    "\n",
    "指定一个 `DocumentLoader` 来把你指定的非结构化数据加载成 `Documents`。一个 `Document` 是文字（即 `page_content`）和与之相关的元数据的结合体\n",
    "\n",
    "此处我们使用 `WebBaseLoader` ，从网页中加载一个 `Documents`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "cf4d5c72",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import WebBaseLoader\n",
    "\n",
    "loader = WebBaseLoader(WEB_URL)\n",
    "data = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3fa0645",
   "metadata": {},
   "source": [
    "Langchain 内提供了非常多样的 Loader ，辅助用户从不同来源读取数据。这些 Loader 都声明在 `langchain.document_loaders` 包中。\n",
    "\n",
    "对于我们的中文示例，我们还提供了一种从 PDF 读取 `Document` 的演示样例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c079033",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install pdfplumber\n",
    "from langchain.document_loaders import PDFPlumberLoader\n",
    "\n",
    "if is_chinese:\n",
    "    loader = PDFPlumberLoader(\"example_data/中国古代史-明朝.pdf\")\n",
    "    data = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd2cc9a7",
   "metadata": {},
   "source": [
    "## Step 2. Split\n",
    "\n",
    "接下来把 `Document` 切分成块，为后续的 embedding 和存入向量数据库做准备。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "4b11c01d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size = 384, chunk_overlap = 0, separators=[\"\\n\\n\", \"\\n\", \" \", \"\", \"。\", \"，\"])\n",
    "all_splits = text_splitter.split_documents(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a33bd4d",
   "metadata": {},
   "source": [
    "## Step 3. Store\n",
    "\n",
    "为了能够查询文档的片段，我们首先需要把它们存储起来，一种比较常见的做法是对文档的内容做 embedding，然后再将 embedding 的向量连同文档一起存入向量数据库中，此处 embedding 用于索引文档。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e9c302c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import QianfanEmbeddingsEndpoint\n",
    "from langchain.vectorstores import Chroma\n",
    "\n",
    "vectorstore = Chroma.from_documents(documents=all_splits, embedding=QianfanEmbeddingsEndpoint())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc6f22b0",
   "metadata": {},
   "source": [
    "除了非结构化的文档以外，Langchain 还可以从多种数据源获取数据并将它们存储起来。\n",
    "\n",
    "![lc.png](img/qa_data_load.png)\n",
    "\n",
    "## Step 4. Retrieve\n",
    "\n",
    "我们可以使用 [相似度搜索](https://www.pinecone.io/learn/what-is-similarity-search/) 来从切分的文档内获取数据，获取到的数据会作为最终提交给 LLM 的 prompt 的一部分。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2c26b7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = vectorstore.similarity_search_with_relevance_scores(QUESTION1)\n",
    "[(document.page_content, score) for document, score in docs]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "415d6824",
   "metadata": {},
   "source": [
    "## Step 5. Generate\n",
    "\n",
    "接下来我们就可以使用我们的大模型（例如文心一言）和 Langchain 的 RetrievalQA 链，来针对这篇文档进行提问并获取我们想要的回答了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99fa1aec",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains import RetrievalQA\n",
    "from langchain.chat_models import QianfanChatEndpoint\n",
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "QA_CHAIN_PROMPT = PromptTemplate.from_template(CUSTOM_PROMPT_TEMPLATE)\n",
    "\n",
    "llm = QianfanChatEndpoint()\n",
    "retriever=vectorstore.as_retriever(search_type=\"similarity_score_threshold\", search_kwargs={'score_threshold': 0.0})\n",
    "                                   \n",
    "qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever, chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT})\n",
    "qa_chain({\"query\": QUESTION1})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7d52c84",
   "metadata": {},
   "source": [
    "在上面的执行中，我们使用了千帆平台上提供的大模型调用，成功就一个问进行了问答。\n",
    "\n",
    "此外，`RetrievalQA` 链中使用的 `prompt` 参数也是可以定制的。由于 Langchain `RetrievalQA` 链中默认提供的 prompt 是用英语编写的，所以此处我们替换为了我们手动实现的中文 prompt，针对中文语境进行优化。\n",
    "\n",
    "#### 使用不同的大模型\n",
    "\n",
    "\n",
    "除了使用默认的模型，即 `ERNIE-Bot-turbo` 以外，用户还可以设置上面 `QianfanChatEndpoint` 的 `model` 参数，来指定使用不同的大模型。例如我们想使用 ERNIE-Bot-4 模型时，就可以这么设置："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05f194c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = QianfanChatEndpoint(model=\"ERNIE-Bot-4\")\n",
    "\n",
    "qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever, chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT})\n",
    "qa_chain({\"query\": QUESTION1})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "214cfd40",
   "metadata": {},
   "source": [
    "或者如果你已经在千帆平台上购买了资源并部署了自己的大模型服务，千帆 Langchain 组件还提供了 `endpoint` 参数，让你能够在 Langchian 中调用自己微调的大模型。有条件的用户可以取消注释并修改下列代码进行体验。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8456042",
   "metadata": {},
   "outputs": [],
   "source": [
    "# llm = QianfanChatEndpoint(endpoint=\"your_service_endpoint\")\n",
    "\n",
    "# qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever, chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT})\n",
    "# qa_chain({\"query\": QUESTION1})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff40e8db",
   "metadata": {},
   "source": [
    "#### 返回源文档\n",
    "\n",
    "用于 QA 的知识文档也可以通过指定 `return_source_documents=True` 被包含在返回的字典里"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60004293",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains import RetrievalQA\n",
    "\n",
    "qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever, chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT}, return_source_documents=True)\n",
    "result = qa_chain({\"query\": QUESTION1})\n",
    "len(result['source_documents'])\n",
    "result['source_documents']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4380e478-e8ae-404b-9577-6b15475a6562",
   "metadata": {},
   "source": [
    "## Step 6. Chat\n",
    "\n",
    "我们还可以加入 `Memory` 模块并替换使用 `ConversationalRetrievalChain` 来实现记忆化的对话式查询。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f56838d-29a5-405f-a6ba-7b687ee56268",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.memory import ConversationSummaryMemory\n",
    "from langchain.chains import ConversationalRetrievalChain\n",
    "\n",
    "memory = ConversationSummaryMemory(llm=llm,memory_key=\"chat_history\",return_messages=True)\n",
    "qa = ConversationalRetrievalChain.from_llm(llm, retriever=retriever, memory=memory, combine_docs_chain_kwargs={\"prompt\": QA_CHAIN_PROMPT})\n",
    "qa(QUESTION1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0d9d803-c183-49c8-97b0-1d5fed7ac52a",
   "metadata": {},
   "outputs": [],
   "source": [
    "qa(QUESTION2)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  },
  "vscode": {
   "interpreter": {
    "hash": "58f7cb64c3a06383b7f18d2a11305edccbad427293a2b4afa7abe8bfc810d4bb"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
